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Abstract —Big data integration conid involve a large nnmber 
of sonrces with unpredictable redundancy information between 
them. The approach of building a central warehousing to inte¬ 
grate big data from all sources then becomes infeasible because of 
so large number of sources and continuous updates happening. 
A practical approach is to apply online query scheduling that 
inquires data from sources at runtime upon receiving a query. 
In this paper, we address the Time-Cost Minimization Problem 
for online query scheduling, and tackle the challenges of source 
permutation and statistics estimation to minimize the time cost of 
retrieving answers for the real-time receiving query. We propose 
the online scheduling strategy that enables the improvement 
of statistics, the construction of source permutation and the 
execution of query working in parallel. Experimental results show 
high efficiency and scalability of our scheduling strategy. 

I. Introduction 

Big data not only means a large volume of data, but also 
indicates variety that data can be from a large number of 
sources. The integration of big data that unihes varieties of 
data from widely-distributed sources can act as the foundation 
of information conformity for applications. For this big data 
integration, it has two major changes compared with tradi¬ 
tional data integration. First, continuous updates may exist 
in a fair amount of data sources. As a consequence, the 
traditional way of building a central warehousing to integrate 
data from sources would become infeasible. A practical way 
is to apply online scheduling that inquires data from sources 
at runtime upon receiving a query. Second, a large proportion 
of unpredictable redundant data exist between sources. Thus, 
the query results returned from different sources should be 
judged, removing the repetitive ones. To deal with these two 
challenges, an intelligent online query scheduling could be 
designed to answer the query with non-redundant results in 
the least possible time. 

In this paper, we set out to address this online query 
scheduling problem. In its scenario, as illustrated in Figure 
the querier that does not cache any data itself and the domains 
that act as data sources are independent each other. For query 
scheduling, the querier may hrstly arrange the domains in a 
permutation. Then, upon receiving a query, the querier sequen¬ 
tially inquires each domain (or few domains in parallel) at a 
time following the permutation. When obtaining new results 
from a domain, the querier compares them with previous 
results received from other domains, and removes the repetitive 
ones. Finally, the querier returns the results after receiving 
enough ones. We also found that the scenario depicted in 
Figure [T] is common in research and practical application. In 
application, this style of online query on multiple sources is 



Fig. 1. The scenario of online query scheduling for integration 


usually seen in Aggregator. For example, Google News, a news 
aggregator, watches updates from more than 4500 worldwide 
news sources, and exhibits non-redundant news to the readers; 
KAYAK.com, a trip aggregator, shows similar trips obtained 
from hundreds of travel sites. More aggregator applications can 
be referred in H]. Besides, in domain-centric research, related 
studies mainly focus on integration of the knowledge about a 
topic from widespread sources. As shown in the experiment 
of a, about 5000 sources are needed to be queried on for 
acquiring 95% knowledge of a topic due to a large portion of 
redundant knowledge. Also, similar results in other topics are 
given in 0. 

For reduction of query time, in our experience of online 
scheduling, we have found that the critical point lies in the 
permutation of sources. Each data source is with a different 
access time, a different transfer time, and a different proportion 
of intersection data (namely redundant information) with other 
sources. Choosing a good permutation of sources can reduce 
the total time cost for a query, especially with much more 
significantly reduction for the query in a long run on a large 
number of sources. To better describe the permutation choice, 
we give a simple example on query Qk (namely a query Q 
for acquiring k tuples) here: 

Example 1. Consider three sources Si, S 2 and S 3 shown 
in Figure For simplicity, set the access time tai = to 2 = 
tea = 0 and the transfer time of retrieving a tuple tri = 
0.7 ms, tr2 = 1.1 ms and tr^ = 1.5 ms for Si, S2 and S3 
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Fig. 2. Venn diagram of Si, S 2 and S 3 



285 

250 


U) 

o 137.5 
li 106.4 


50 96.8125 190200 

Tuples 


Fig. 3. Time cost of the permutations 
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Fig. 4. The architecture of online query scheduling 


respectively. Upon receiving a query Qk, the querier starts to 
estimate the statistics of sources, and has the estimated value 
of statistics that Si, S 2 and S 3 have |S'i| = 50, |iS' 2 | = 125 and 
l^al = 75 result tuples for Q respectively, and the intersection 
tuples IS*! n 521 =35, |5i n =5, |52 H = 10 and 
|5in52n5a| = 0. Then, the querier has the following optimal 
permutations Hopt on time cost (permutation Si and S 1 S 2 
can be seen in curve S 1 S 2 S 3 , and permutation S 2 , S 2 S 3 and 
S 2 S 3 S 1 can be seen in curve S 2 S 3 S 1 in Figure^: 


Uopt — * 


5i 

0 < A: < 50 

S 1 S 2 

50 < fc < 96 

S 2 

96 < fc < 125 

S 2 S 3 

125 < fc < 190 

S 2 S 3 S 1 

190 < fc < 200 


( 1 ) 


For instance, suppose that the querier receives a query Q 
of 125 tuples and only starts one thread to process it, the 
optimal permutation S 2 has the query time cost tr2|52| = 
137.5 ms. In contrast, given another permutation S 1 S 2 , only 
50 tuples are obtained after querying on Si; then, the querier 
start to inquire S 2 for another 125 — |5i| = 75 tuples, and 

the expected time of receiving a non-repetitive tuple from S 2 is 
tr2]S2\ 


| 53 |-|g^nS 2 | ms; finally, following permutation S 1 S 2 , 

the querier returns the result of 125 tuples with time cost 
iri|5i| + 1.5(125—|5i|) « 147.5 ms, greater than the optimal 
time cost 137.5 ms. 

We named the problem addressed as Time-Cost Minimiza¬ 


tion Problem (TMP, formally defined in Section III-Al. In 
implementation, TMP has two major difficulties: statistics 
estimation (about the intersection between sources) and opti¬ 
mal permutation based on statistics. Concerning on these two 
difficulties, we make the following contribution^] 

• We prove that TMP is NP-complete, and pro¬ 
pose OnlinePerm algorithm that constructs the per¬ 
mutation n(F) with the time cost T{Qk{Il{Y))) 

--approximately to the opti- 

mal time cost T(Qk{Jlopt{Y))) for a query Q of retriev¬ 
ing k tuples. 

• We present the mechanism of two-stage detection for 
statistics collection. Especially, to avoid the exponentially 


'For quickly understanding our paper, please refer to Section 
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(Figure 1^ and Section [Vl| Online Query Strategy (Figure 
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growing complexity of detection when the number of 
sources increases, the statistics collection mechanism 
apply the pruning techniques that only probes the critical 
statistics on-the-fly. 

• We propose the online scheduling strategy that enables 
the improvement of statistics collection, the permutation 
construction of OnlinePerm algorithm and the execution 
of query on sources working in parallel, so as to reduce 
the total time cost for the query. 

• We conduct experiments on physical sources. The experi¬ 
ment results show that our scheduling strategy is scalable 
and can significantly reduce the time cost. 

The rest of this paper is organized as follows. Section |II] 
presents the related work. Section III gives the overview of 
online query scheduling. Section IV presents the OnlinePerm 
algorithm for source permutation. Section |V] proposes the 
mechanism of two-stage detection for statistics collection. 


Section VI describes of the scheduling strategy. Section VII 


presents the experiment results. Section |VIII concludes this 
paper. 


II. Related Work 

There exist some research work EniiaiTiiiiiiiiiiici 
on consideration of intersection or redundancy information for 
source selection. In 0, sources were classified into domains 
(e.g. Journal paper. Conference paper), and probabilistic inter¬ 
sections between domains were considered to select sources 
with more answers. In ifT^ . the challenges on calculating the 
exponential number of source intersections were discussed 
to provide useful knowledge for source permutation. In Q, 
the StatMiner system was proposed to learn the union and 
intersection statistics between classes (including sources and 
quieres) and provide static selection of sources with more 
relevant answers for each class. In El, the maximize number 
of duplicates were found between sources to answer the Join 
queries for entities. In ||9l, the copying relationships between 
sources known in advance were considered to determine the 
permutation of sources with the target of cost-minimization 
and maximum-coverage. In ||4l, possible intersections or du¬ 
plicates between sources are studied to estimate the expected 
value for data items, and thus data integration system could 
construct a permutation of sources by the accuracy of data 
items. In ifTTIl . the answers for a query were estimated by the 
statistics (including intersection knowledge) obtained by other 
queries for the crowdsourced system, and thus a permutation 































of sources can be given by sorting the sources with the number 
of answer tuples in descending order. In 0, the OASIS system 
was proposed to online collect intersection statistics and find a 
permutation of sources to maximize the Area-under-the-curve 
value for retrieving all answer tuples from sources. 

Summary of differences; Our work differentiates with 
related literatures in following aspects; (1) we take the real¬ 
time receiving query Qofk tuples as input, (2) we focus on the 
Time-Cost Minimization problem for source permutation, (3) 
we propose a two-stage detection mechanism that combines 
the offline and online collection processes to estimate intersec¬ 
tion statistics between sources, and (4) we present the online 
scheduling strategy that enables source permutation, statistics 
collection and query execution working in parallel. 


III. Overview 


A. Model and Problem 


Given a set of sources Y = {Si, S 2 , ■■■, Si} and a permuta¬ 
tion n(y) for a query Q, the query rate Vi of a source Si G Y 
can be written as follow; 


|^^|-|ng,| 

tOi -f tr^\S^\ 


( 2 ) 


In 0 , jS'il is the total number of result tuples for Q in source 
5'i; yn 5'i I is the intersection tuples that has been transferred 
to the querier from other sources (that is, these intersection 
tuples are in \Sj n 5^1, where Sj is prior to Si in 11(1")); tui 
and tri denote the access time and the transfer time of a tuple 
from Si to the querier respectively. 

Suppose that the permutation 11(1") has 11(1") = 

^( 1 )^( 2 )...%), - I n 5,1) < fc < - 

I n Si\), and only one query thread is running in the querier, 
we have the average rate . ^nd 

the time cost r((5fc(n(y))) = The optimal permutation 
^optiy) has the least time cost compared to any other 
permutations. We define the Time-Cost Minimization Problem 
as follow; 


Definition 1 (Time-Cost Minimization Problem (TMP)). 
Given a query Q of k tuples and a set Y = {5i, S 2 , Si} 
of sources, find the optimal permutation nopt(l") of sources, 
having time cost T{Qkyopt{Y))) < T(( 3 fe(n(l"))) /or any 
other n(l"). 

B. Architecture 

The architecture of online query scheduling (Figure 
mainly contains three components; 

• Source Permutation (SP); Upon receiving a query Qk, SP 
repeatedly improves the permutation 11(1") for Qk based 
on the continuously collected statistics provided by the 
Statistics Collection (SC) component. The permutation 

n(y) = 5 (i) 5 ( 2 )...%), - I n 5 , 1 ) < fc < 

n 5i|), contains sources having a total of 
no less than k tuples for Qk, and the set l"e contains the 
remaining unselected sources. 

• Statistics Collection (SC); SC collects the statistics from 
all sources. Its process of collection is divided into two 
stages. In the Initial-detection stage, SC generates a query 


Q that retrieves all tuples from all sources (or samples 
from all sources) in Y to obtain a general statistics. Then, 
the statistics of any other query can be regarded as a 
subset of the statistics of Q. When receiving the query 
Qk, SC starts the Online-detection stage. In this stage, the 
statistics of Qk are firstly estimated by the statistics of Q, 
and then continuously improved by the online detection 
results on sources. 

• Query Execution (QE); QE runs the query threads to 
retrieve the tuples of results for Qk from sources fol¬ 
lowing the permutation 11(1") constructed by SP. When 
the querier has already received k tuples for Qk (or 
all sources have been queried), QE sends a signal to 
terminate the running process in SP and SC. 


Next, we present SP and SC in Section [TV| and [V| respectively. 
Then, in Section VI We describe the scheduling strategy that 
enables SP, SC and QE working in parallel. 


IV. Source Permutation for Minimal Time Cost 

In this section, we firstly prove that the TMP is NP- 
complete, and then provide two observations that reveal the 
correlation between the query rate and the residual tuples. 
Einally, we propose the OnlinePerm algorithm for source 
permutation based on these two observations. 

Theorem 1. TMP is NP-complete. 

Proof: We prove the NP-hardness of TMP with the Set 
Cover Problem. Given a universal set and a set of subsets 
whose union equals the universal set, the Set Cover Problem 
is to find the smallest number of m subsets whose union equals 
the universal set. The Set Cover Problem can be instantiated as 
follows. Suppose that the universal set Y contain |1"| -f |T"p 
elements, |y| of which are the union of tuples for query Q 
from Si G Y , i = 1, 2,..., I, and the other \Y\^ of which are 
from the newly generated tuples for query Q. Create I subsets 
5]^,52,...,5; of 1" . Each subset 5^ contains |5i|-|-|l"p elements 
that are the tuples for query Q from source Si and the newly 
generated ones. 

Eor a query Q of |T"| -f |Fp tuples, if there exist an 
optimal permutation HoptiY) of m sources, we can easily 
see that the union of all elements in these sources cover all 
the elements of the universal set. Thus, the reduction from 
TMP to Set Cover Problem is established. In turn, suppose 
that there exist smallest number of m subsets S^^-^y ^c( 2 )’ 
..., S^y^,^ whose union equals the universal set Y . We firstly 
construct a permutation n(y) by randomly arranging these m 
subsets, and set tai = 0 ms and tri = 1 ms for V5i G Y. 
Then, we have the time cost for query Q of |F| -f jUp tuples 


T(Q,r,+,r,2(n(Y)))<m(lYl + lYy (3) 

Consider any other permutation If (V) with m + 1 subsets ; 

T(Qiri+iri 2 (n'(Y))) >(m + l)|yp (4) 

Eor this instance of TMP, we have T((5|y|_(_|y|2(11(1"))) < 
^"(Qiyi-i-iyi 2(11 (!"))). Hence, the optimal permutation 
nopt(Y) with m subsets exists if the union of these m 
subsets covers the universal set. 





The decision version of TMP is to decide the query time 
cost for a given permutation n(F). The total time cost for 
n(F) can be calculated out in 0{l) time. Therefore, TMP is 
NP-complete. ■ 

Since TMP is NP-complete, the basic solution of traversing 
all permutation of sources will soon become unmanageable 
when the number of sources increase. We consider an alterna¬ 
tive scalable algorithm to construct the permutation for TMP. 
Naturally, to incrementally construct the permutation n(y), 
a greedy algorithm can be planned to sequentially choose 
a source Si that has the fastest query rate Vi = 
at each iteration. For the example of Figure source Si is 
first selected since its query rate vi = 1.43 tuple/ms is the 
fastest among these three sources. Afterward, source S 2 is then 
extracted with V 2 = 125 x 1 1 “ tuple/ms. Finally, source 
S 3 is chosen with 113 = = 0.53 tuple/ms. Another 

approach is to examine the residual tuples of each source 
and choose the source with maximal residual tuples at each 
iteration. For the example of Figure]^ source S 2 is first chosen 
since it has the maximal residual tuples |iS' 2 | = 125. Then, 
source S 3 and Si are selected sequentially with residual tuples 
of 65 and 10 respectively. However, as shown in Example 

neither the permutation S 1 S 2 S 3 by greedily considering 
fastest query rate nor S 2 S 3 S 1 by greedily considering maximal 
residual tuples is an optimal solution. 

Actually, we investigate both the effect of query rate and 
residual tuples on source permutation for TMP. As can be 
seen in Figurecurve S 1 S 2 S 3 returns more tuples compared 
to curve S 2 S 3 S 1 with the same time before the crosspoint of 
(96.8,106.4), and the result reverses after the crosspoint. For 
a given query we could find a better permutation by dis¬ 
covering possible crosspoints. Next, we give two observations 
that reveal the correlation of query rate and residual tuples on 
crosspoint appearance. 

Observation 1. The crosspoint of two permutation curves 
appears when a swap happens between a source with faster 
query rate and another source with more residual tuples. 

In Figure |5(a)| source has faster query rate and less 
residual tuples compared to source S 2 at time t^. The query 
rate of the curve ni(y) slows down after Si has been queried 
while the query rate of the curve n 2 (E) keeps and catches up 
ni(y) at that time since S 2 has more tuples than ^i. The 
crosspoint of curve ni(y) and Il 2 {Y) exists when S 2 has 
enough more tuples and a little less query rate than ^i. 


Observation 2. The crosspoint of two pennutaion curves does 
not exist when a swap happens between two sources that have 
few intersection tuples. 


For the example in Figure 5(b) the query rate V 2 of S 2 has 

v2/n.2{y)) - v2{ni{Y)) = ni(r) 

and I\. 2 {Y). If source Si has fewer intersection tuples with 
S 2 , that is, | 5 'inS' 2 | has a small value, we have U 2 (ni(y)) « 
t^ 2 (n 2 (E)) and ui(ni(y)) « wi(n 2 (F)) for ^i. In this case, 
the crosspoint of these two curves does not exist. 

The Observation and indicate heuristics for design¬ 
ing the Online Permutation (OnlinePerm) Algorithm. Before 
describing it, we firstly present the inputs, functions and 
algorithms on which the OnlinePerm algorithm is built. Given 
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a query Q of k tuples, the following inputs provide essential 
information for the construction of source permutation; 

• Statistics Input SI: The statistics are from the set Y = 
{Si, S 2 ,..., Si} of sources, including the access time cost 
ttti, time cost tri of querying a tuple, the number of tuples 

15^1 answering the given query Q and the number of 
intersection tuples | H 5'i | for any permutation of sources, 
i = 1,2, ...,(. (Let I n E| to denote and | n Si\ for 
S'i S F. I nE| is unknown or imprecise upon receiving Q 
at runtime. In Section jV) we will show how to estimate 
I n y| by two-stage detection); 

• Intersection Threshold dgp-. For a source Si, its intersec¬ 

tion proportion with another source Sj can be computed 
by If the value of is below the threshold 

9sp, we can ignore the source Sj for Si in the permutation 
construction. 

We define the following functions; 

• InSec: Given a permutation n(y) of selected sources for 
the query Qk and a set Ye of unselected sources, function 
InSec returns the number of intersection tuples | n S'^j 
with the sources in n(F) for Si G Yg. 

• Counter: Given a permutation n(F) of selected sources 
for the query Qk, function Counter counts the total 
number of tuples neum = I]siGn(Y)(l'S'i| - I G 

• Perm2Set: Supposing that ~ I ^5^1) < k < 

~ iG^il), the function Perm2Set removes the 
{ik + l)th, {ik + 2)th, ..., |n(F)|th sources from n(F), 
and add these sources to the set Ye of unselected ones. 





























Algorithm 1: Greedy on Query Rate (GreedyQR) 


Input: A query Qk', A permutation n(y) of selected 

sources; A set of unselected sources; Statistics 
Input SI 

Output: The new permutation 11 (F) and set Yg-, The 
total number of tuples Usum of n(y); The 
average query rate Vavg of n(y) 

1 [nsum\= Counter{T\{Y),SI)-, 

2 if risum > k then 

3 I [risum: ^avg] = Perm2Set{Il{Y), Ye, SI, k); 

4 else 

5 Set a empty source Sq with no tuple; 

6 while Usum < k do 

1 Set Vmax — 0? 

8 [n5„5, G Fe] = InSec{U(Y),Ye,SI); 

9 foreach Data Source Si € Fg do 


10 

11 

12 

13 

14 

15 

16 


end 


Calculate the query rate Vi = 
if Vi Vmax then 
I Set Vmax — and Sq = Si, 

end 
end 

[nsum. Vavg] = Set 2 Perm{Sq, n(F), Fg, SI); 


17 end 

18 return n(F), Fg, Usum and Vavg', 


Algorithm 2: Re-Permutations (RePerm) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


Input: A selected source S^ A query Q^; A permutation 
n(F) of selected sources; A set Fg of unselected 
sources; The average query rate Vavg of n (^)7 
Statistics Input SI; Intersection Threshold 9sp 
Output: The rebuilt permutation 11 (F); Set Y^; Query 
rate v^^g 

Vavg 

[ns(ye)] = Sort{Si,Ye, 

Si e n(r) and Sj e Ye; 
for j = 1 , 2 ,..., |n,(T;)| do 

Fetch the fth source S', from 11,(11); 
if |S,| < \Sj \ then 

[U''{Y)X)] = Su;ap(S„S„n(r),y,); 

[ff yV'sumfVavg] 

GreedyQR{Q,k,Il {Y),Y^,SI); 

^f Vo^yg > Vy^yg then 

Set n'(F) = n"(F), fJ = Y^' and 

/ // 

V = V : 
avg avg^ 

end 

end 

end 

return 11 (Y), Y^ and 


Then, Perm2Set returns the total number of tuples Usum 
and the average query rate Vavg of n(y); 

• Set2Perm: Given the permutation 11(1") of selected 
sources and a set Yg of unselected sources, the function 
Set2Perm adds a source Sq £ Yg to the tail of the 11(1"), 
removes Sq from Yg, and returns the total number of 
tuples nsum and the average query rate Vavg of n(l"); 

• Swap: Given Si G n(l") and Sj £ Yg, the function 
Swap replaces Si with Sj in 11(1"), removes the sources 
ranked behind Si from 11(1"), and adds Si and these 
removed sources to Yg. Then, Swap returns the new 
n(y) and Yg- 

• Sort: Given Si £ 11(1") and an expression Exp, the 
function Sort sorts the sources in Yg by the value of 
Exp, and discard the sources with the value below the 
threshold Osp- Then, Sort returns the sorted permutation 
n,(re) for n; 

• LockWrite: The function LockWrite writes 11(1") to 
the shared memory for the query Qk with the update 
lock; 

The Algorithm Greedy on Query Rate (GreedyQR), 
invokes the function Counter to check the total number of 
tuples Usum in 11(1"). If nsum > k, GreedyQR moves extra 
sources from 11(1") to Yg by Perm2Set', else greedily selects 
the source with the maximal query rate and add it to the tail 
of n(l") by Set2Perm until ngum > k. 

The Algorithm]^ Re-Permutations (RePerm), explores the 
possible swaps between a selected source Si £ 11(1") and 
these sources in Yg. Based on the Observation RePerm 
firstly selects the sources from Yg by sorting with the value 


of '' ^ in descending order, and ignores the source Sj if 

^g^ < Osp, ySj £ Yg. Then, RePerm only chooses the 
source Sj having 15^1 < to swap Si by the knowledge 
from Observation After the swap process, RePerm re-ranks 
Si, the sources that are originally ranked behind Si in 11(1"), 
and the sources in Yg by GreedyQR to construct a candidate 
permutation 11 (1"). Finally, RePerm returns the permutation 
n (y) with the maximal average query rate from all 
candidate permutations. 

The OnlinePerm algorithm (Algorithm]^ is proposed based 
on the effect of both the query rate and the residual tuples. 
If the given permutation n(y) is empty, the OnlinePerm al¬ 
gorithm firstly construct the permutation 11 (y) by GreedyQR 
that greedily adds the source with the maximal query rate 
from Yg sequentially, and invokes LockWrite to write n(y) 
to the shared memory for query Q (The process of the 
Query Execution component that works in parallel with the 
OnlinePerm algorithm uses LockRead to read the permutation 
n(y), and queries on sources sequentially following n(y)). 
Since the sources that are more closely to the head of n(y) 
are queried firstly, OnlinePerm checks and swaps the sources 
in n(y) sequentially by RePerm from its head to tail. As 
presented above, RePerm tries to find a new permutation 
n (y) based on the knowledge of the correlation between 
query rate and residual tuples provided by Observation [T] and 
^ Once a new permutation 11 (y) having the average query 
rate > Vavg is found, OnlinePerm invokes LockWrite 
to write 11 (y) to the shared memory. 

Construct a permutation n„(y) that arranges sources by 
query rate Vu(^) = descending order 



















(without considering the intersection tuples, e.g., | n |), 
where 3 ^( 1 ) denotes the ith source in n„(y). Let u{ik) 
denote the number of sources that has I'S’iiCi)! ^ k and 

|5'„(j)| < k. We have the following theorem for Qk'. 

Theorem 2. The time cost T(Qfe(n(y))) ofIl{Y) constructed 

by OnlinePenn is -- approximate to 

the optimal time cost T{Qk(Jlopt{Y))) far TMP. 


Proof: The time cost T{Qk{Tlopt{X))) is greater than 
’r(Qfc(n„(L))) that ignores the intersection tuples : 

u(ik) 

T{Qk{Uopt{Y))) > {tau(i) + tru(i)\Su{i)\) (5) 

i=l 


GreedyQR has the monotonicity property that Vi > Vj for 
sources Si ranked ahead of Sj in permutation n(F). On- 
linePerm performs as well as GreedyQR at worst with no swap 
happening. Thus, according to the monotonicity property, we 
have : 


k ^ |r| 

T{Qk{U{Y))) - + 


( 6 ) 


Combine © and @ : 

r(Qfc(n(y))) ^ + ^G|^^I) 

T{Qk{n,pt{Y))) - 


Corollary 3. Given tat tri\Si \ and tvi « tvj far ySi, Sj € 
Y, the time cost T(Qfc(n(y))) of Il{Y) by OnlinePerm is 


-approximate to T{Qk{IloptiY))) for TMP. 


Proof: With the given tui ^ tri\Si\ and tvi « tvj, the 
rate of querying any number of tuples with the permutation 
n„(y) is approximately close to a constant ; 


Combine 0 and 0 : 

T(Qfc(n(y))) ^ sLi \s^\ 

T{QkiIlopt{Y))) - |r| 


OnlinePerm sequentially checks the possible swaps between 
ySi G n(F) and Sj G Ye having ^ > bsp from the head 

to the tail of n(y). Both the line 7-14 of Algorithm and 
the line 3-12 of Algorithm runs in 0(1) time. The time 
complexity of GreedyQR (Algorithm [U is 0(P). Therefore, 
OnlinePerm has time complexity 0{lfa. Given some restric¬ 
tions on 9sp, e.g., set 6 gp{Si) equal to the maximal value of 
ySj G Ye, the time complexity of OnlinePerm can be 
reduced to 0 {P). 

V. Statistics Collection 
The statistics inputs SI described in the last section should 
be collected from sources to the querier so as to be taken 
as inputs for OnlinePerm. The SC component (figure can 
obtain the access time cost tui and the per-tuple transfer time 


Algorithm 3: Online Permutations (OnlinePerm) 

Input: A query Qk', A permutation n(F); Statistics 
Input SI', Intersection Threshold 9sp 
Output: The new permutation n(y) for query Qk 

1 Set the permutation n(y) as an empty queue; 

2 Set the set Yg =Y', 

3 if n(y) == (j) then 

[n(y),ye :’^avg] — 

GreedyQR(Q, k, n(y), Yg, SI)', 

5 end 

6 Execute LockWrite(Il(Y))', 

7 for i = 1, 2,..., |n(y)| do 
Fetch the *th source Si from n(y); 

[PtiY)X,v'avg] = 

RcPeTTTl(Si , Q, k, II(y ), Yg, Vavg : SI, Osp), 


to 
ri 
12 

13 

14 end 


if ^avg ^ tJavg theU 

Set Vavg = v'avg and n(y) = n'(y); 
Execute LockWrite{fl{Y))', 


end 

IS return n(y); 


cost tri of SI for each source Si G Y h'y generating queries 
on Si- Then, tui and tVi can be calculated by the time cost 
of retrieving result tuples, e.g., it takes 850 ms to retrieving 
1000 tuples from Si, written as toi -f lOGOfr^ = 850 ms. 
However, SC cannot precisely predict and pre-detect |ny| 
of SI for a query Qk given in real time. Alternatively, SC 
solves this problem in two stages. In Initial-detection stage, 
SC generates a query Q^y\ retrieves all tuples from all 

sources (or samples from all sources), and estimate | H y| for 
Q|p|. The query result of any other query can be regarded as 
a subset of the result of Q|y|. Thus, in the Online-detection 
stage, the estimation of | n Lj for a real-time given query Qk 
can be derived both from | H Y\ and online detection results. 

A. Initial-detection 

In the Initial-detection stage, SC generates a query Q^y\ 
to retrieve all tuples from all sources in Y. For the sake of 
simplicity, we abuse notation and let |5i| and IS"'! denote the 
number of tuples in and not in Si for query Q^y\ respectively. 

Combine all 15^1 or [S''!, i = l,2,...,l, as variables in set 0, 
e.g., (for Z = 3) as a variable denotes the number 

of tuples in and not in S 2 , S 3 . Also, Let Hg represent the 
set of variables that counts tuples in q sources and not in the 
other I — q sources. For a variable Wg+i G flg+i, a variable 
Wg is a parent of tUg+i if Wg and tu^-i-i are different (”in” or 
’’not in”) in only one source, e.g., S Hi is a parent 

of |5'i5'25'3| G H 2 . Let |5i| be the ancestor of all variables that 
consider tuples in Si. As refer to | H Lj, given a permutation 
n(y) of sources, the number of intersection tuples | nS'ij for 
a source Si G Y can be estimated by adding up all the |S'i| 
and |Sj| variables that consider tuples both in Si and Sj for 
any Sj prior to Si in n(y). 




















To derive the estimation of | n F|, SC needs to detect 
the value of all these variables in 17. The set 17 has a total 
of 2 * variables; the detection complexity of all variables in 
17 grows exponentially as I increases. It is impossible to 
detect all variables when hundreds or even thousands sources 
exist. Therefore, SC applies the pruning techniques that (1) 
iteratively add variables to the detection set, ( 2 ) iteratively 
remove variables from the detection set, and (3) estimate the 
value of variables by Maximum Entropy la. 

In detail, SC firstly detects the value of |S'i|, ..., IS”;! 

from sources, and start the iterations of I times. Let Wi denote 
the detection set containing variables added in the ith iteration. 
In the (i+l)th iteration, SC hrstly removes the variables whose 
value are blow a given threshold 9sc from Wi, and queries on 
the sources for the value of the remaining variables in Wi after 
the removal. Then, SC considers the variables in 17q+i. If a 
variable w G fig+i has Parent(w) > Ogc, the value of w 
may also greater than 9sc- SC add such variables like w to the 
detection set Wi+i. Suppose that all variables in Wi+i have 
equal weight, and thus according to the principle of Maximum 
Entropy, SC can estimate the value of these variables in Wi+i 
by solving the following MaxEnt problem : 


max; 


E 


w log id 


wGWi^i 

* s.t. jS'il = w, w G Ancestor {\Si\) && w G W 


W = U{W„j = l,2,...,i + l} 

( 10 ) 

By solving ( [T0| ) with Lagrange multipliers ifTOl , SC can get 
the expected value of all variables in Wi+i. When the loop 
is finished, SC can further estimate | n Y\ for Q^y\ 
value of all variables in W, W = yj{Wj,j = 1,2,...,^}. 


Example 2. Consider a simple scene of three sources Si, S 2 
and S' 3 , and assume that all variables to be determined have 
value greater than 9 sc- SC firstly queries on these three sources 
for the value of |S'i|, |S' 2 | and 1531 . In the lif iteration, SC 
solves the MaxEnt problem for Wi : 


{ max: — Wj log Wj 

s.t. |5i|=wi, | 52 |='u ;2 
ISsl = W 3 


( 11 ) 


Where we have wi = I5i I. W 2 = I 535253 I and W 3 = 
I 535253 I. By solving {IP, SC can get the estimation of wi, 
W 2 and W 3 . With the assumption that all variables have value 
greater than 9sc, no variable is removed from Wi. In the 2nd 
iteration, SC detects the value of variable wi, W 2 cmd W 3 , and 
solves the MaxEnt problem for W 2 : 


{ max: — Wj log Wj 

- ^ ^ ^ ^ 

s.t. |5i| = wi + {(;4 + ws, \S 2 \ = W 2 + We + w-: 

|<S'3| = W 3 + ws + wg 


( 12 ) 


Where we have W 4 = We = | 5 i 5253 |, Wg = 

wr = | 5 i 525 ^|, W 8 = | 5 ( 5253 | and wg = 
| 5 i^ 53 |. In the Srd iteration, SC solves the MaxEnt problem 
for W 3 = {| 5 i 5253 |}. Given a permutation S 1 S 2 S 3 , the num¬ 
ber of intersection tuples for any source could be estimated, 

e.g. I n 53 I = | 5 ( 5253 | + | 5 i 5 ^ 53 | + | 5 i 5253 |. 

The formal description of Initial-detection algorithm is 
omitted here due to space limit. Initial-detection algorithm runs 
loop of I times, each time with \ Wi \ variables and I constraints 
for the MaxEnt problem, i = 1,2,..., I. 


B. Online-detection 

After the Initial-detection stage, an estimation of I ny| 
for Q|p| has already been established. Then, a permutation 
n(y) = 5 (i) 5 ( 2 )--- 5 (;) of all sources can be constmcted by 
OnlinePerm based on | ny|. Upon receiving a real-time query 
Q of k tuples, SC start the Online-detection stage, and derives 
the estimation of | H Uj from | H U| on-the-fly simultaneously 
with the query execution. 

Online-detection stage can be divided into two sub-stages. 
In the first sub-stage, the number of tuples |5(i) |, i = 1,2,..., I 
for Qk are detected sequentially following the permutation 
n(y). When receiving partial results |5(i)|, | 5 ( 2 )|, ..., |5(q)|, 
SC estimates the number of tuples for other sources : 

l%)l = ^E§^’ z = g + l,g + 2,...,Z (13) 


Rewrite the MaxEnt problem as follow : 


: w 


max: — w log 1 

wGW 

s.t. |%)|=^ w, w G Ancestor (\S(i)\) && w GW 


W = C{W„j = l,2,...,l} 

(14) 

The MaxEnt problem in the Online-detection stage considers 
all variables introduced in the Initial-detection stage. By 


solving (14 1 , SC can get the expected value of all variables in 


W, and thus estimate | H U| for any given permutation n(y). 

In the second sub-stage, all the results of |5(i)|, | 5 ( 2 )|, 
|5(j)| have been received. SC sorts all variables w G W hy 
their expected value jw — wl in descending order, where w 
denotes the value of w estimated in the Initial-detection stage. 
Then, SC sequentially detects the value of these variables 
in W following this order. Upon receiving partial results of 
detection, SC resolves the MaxEnt problem of ( [T4) i for the 
estimation of | H Uj. The Online-detection is terminated when 
the query execution of Qk is finished or all the results of 
detection have been received. 


The formal description of Online-detection algorithm is 
omitted here due to space limit. Online-detection algorithm 
resolves the MaxEnt problem with at most |1U| times; in 
the *th time, estimate the expected value of no more than 
| 1 U| — i-\-l variables. 








VI. Online Query Strategy 

The Online Query Framework is shown in figure It 
applies the dynamic strategy that enables the execution of 
query on sources and the improvement of source permutation 
to work in parallel. 

At the beginning, SC generates a query Q|P| that retrieves 
all tuples from all sources (or samples from all sources), 
and start the Initial-detection process to probe and estimate 
I n F|. Then, SP applies OnlinePerm algorithm to construct a 
permutation P{Y) for all sources based on the estimation of 

|ny|. 

Upon receiving a real-time query Qk, the processes in SC, 
SP and QE start to work simultaneously. In detail, SC starts 
the Online-detection process to derive the estimation of |nr| 
from both the results of online detection and the estimation 
of I n V|. SP repeatedly runs the OnlinePerm algorithm on 
un-queried sources with the continuously new estimation of 
I ny| provided by SC; at the end of each run, the OnlinePerm 
algorithm writes the newly constructed permutation n(F) to 
the shared memory for the query Qk- Simultaneously, QE 
reads the permutation 11 (V) from the shared memory, and 
starts query threads to retrieve the result tuples sequentially 
from sources following n(V). When the querier has already 
received k tuples or all sources have been queried, QE sends 
a signal to terminate the running process in SC and SP. 

VII. Experiment Results 

We conduct experiments on independent sources that are 
self-controlled and only provide query interface for others. 
Each source consists of tuples of educational institutions, as 
shown in Eigure crawled from web sites with a random 
start seed. In experiments, we divide tuples into two sets: Ei 
and E 2 , having E = EiU E 2 and i?i n i ?2 = 0- Each tuple is 
either belong to Ei or i? 2 - We use the query Q of SELECT * * 
EROM E for the Initial-detection stage of SC and the query 
Q of SELECT top k tuples EROM Ei to evaluate the 
time cost of implemented algorithms. 

Eor evaluation, we implemented eight algorithms: 

• Random: Randomly choose a permutation of sources. 

• MaxT: Select the source S with the maximal tuples 
151 each time without considering intersection between 
sources. 

• MaxRT: Select the source S with the maximal residual 
tuples |5| — In5| each time that considers the intersection 
compared with MaxT. 

• MinT: Select the source S with the minimal per-tuple 

retrieve time each time without considering the 

intersection. 

• MinRT: Select the source S with the minimal residual 

per-tuple retrieve time each time that considers 

the intersection compared with MinT. 

• SeqPerm: apply the sequential strategy that starts the 
query execution until the finish of RePerm (Algorithm 
1^. SeqPerm is with additional time cost of permutation 
construction, but may have a better permutation than 
OnlinePerm. 

• OnlinePerm: apply the online strategy that let all the 
components work in parallel. OnlinePerm does not con¬ 
sider the sources have been queried, and only construct 



Fig. 6. Online Query Framework 


Name: Institute of Computing Technology, CAS 
Beginning: 1956 
Homepage: www.ict.ac.cn 


Address: No. 6 Kexueyuan South Road Zhongguancun 
Postcode: 100190 _ 

Tel: (8610)62601166 _ 

Email: office@ict.ac.cn _ 


Fig. 7. Example tuple 


permutation for un-queried sources dynamically without 
waiting for the finish of RePerm. 

• EullKnowledge: apply the permutation constructed by 
OnlinePerm with precise intersection statistics | H V| as 
input. Its performance can be considered as the upper 
bound can be achieved although the permutation may not 
be optimal (Theorem]^. 

All the algorithms were implemented in Java JDK 1.7, 
and ran experiments on a Data Integration System (DIS) of 
our implementation. DIS can manipulate remote relational 
databases or shared folders to create a new source. DIS is built 
on 4 physical machines. By default, DIS has 2035 sources with 
a total of 501760 tuples in these machines, and each tuple is 
limited with the maximal size of 1205. Among these tuples, 
there are 24860 distinct tuples in all. We observed that the 
access time cost of any source in DIS is in [477,2350] ms, and 
the per-tuple transfer time cost is in [0.02,0.42]ms. The querier 
was implemented on a Windows 7 machine with 2.3GHz Intel 
Core i7 CPU, 8GB RAM and Gigabit Ethernet Controller. 
Both the querier and DIS are in the same LAN. Eor simplicity, 
we abuse the notation and let |Yi|, \Y 2 \ and |y| denote the 
number of tuples in Ei, E 2 and E respectively. By default, 
we set |Yi| = [^ 2 ! = 0.5|F| = 12430, k = 0.8|yi|, the 
intersection threshold dgp = 0.05, the detection threshold 
Osc = 0.005, one thread running for the query execution and 
one thread running for the Online-detection process of Q in the 
querier. We next evaluate the query time cost of implemented 
algorithms under conditions of different factors. 

A. Varying k 

We firstly compared the algorithms under the condition 
of varying k by default settings. The row 2-5 of Table |I] 















































TABLE I 

Experiment results 


~~-Algorithm 

Condition -- 

Random 

MaxT 

MaxRT 

MinT 

MinRT 

SeqPerm 

OnlinePerm 

FullKnowledge 

k = 0.2\Yi\ 

38751.6* 

30478.7 

26850.1 

13807.9 

11058.3 

16617.4 

10735.3 

10528.6 

k = 0.4|yi| 

78872.4 

65944.0 

60727.7 

32861.0 

27899.4 

31831.5 

25709.8 

24147.1 

k = o.eim 

133642.2 

120663.7 

109618.4 

59887.5 

52291.0 

51415.3 

47623.3 

43979.9 

k = 0.8|yi| 

236555.6 

209363.2 

195653.2 

114498.3 

96686.0 

90179.0 

85064.9 

81170.8 

2 query threads 

123385.5 

111151.4 

104554.5 

60785.5 

50789.2 

52086.0 

45509.7 

42417.1 

3 query threads 

81871.1 

74074.4 

68009.5 

38940.5 

33843.2 

35530.6 

30339.8 

28278.0 

4 queiy threads 

59615.4 

54455.0 

50029.3 

30891.0 

24287.4 

27836.0 

22754.8 

21208.5 

5 query threads 

48772.5 

41979.7 

40106.5 

24367.9 

19719.2 

23911.2 

18203.8 

17266.8 

2000 sources 

238312.2 

208382.6 

195475.3 

114455.9 

96382.1 

91163.9 

84995.8 

81206.8 

3000 sources 

357818.7 

300007.3 

266163.6 

169115.3 

140972.3 

131285.0 

124255.4 

118620.9 

4000 sources 

478313.0 

393498.0 

340143.2 

221204.8 

181479.6 

167078.1 

159867.4 

152786.7 

5000 sources 

563392.5 

481649.7 

419245.2 

273395.2 

220551.2 

202920.4 

195332.8 

186865.3 

ini = o. 2 |y| 

236438.5 

232865.6 

229671.7 

149629.8 

120992.3 

103944.0 

96441.3 

81350.1 

ini = o.4|y| 

229253.9 

217808.5 

205961.6 

133968.8 

102889.0 

95704.7 

88965.9 

80906.4 

mi = o.6|y| 

245076.4 

202250.5 

190735.5 

107773.1 

92073.5 

89679.4 

84411.5 

81405.5 

mi = o.8|y| 

232241.0 

191382.1 

180240.4 

103067.2 

89814.8 

87802.6 

83070.8 

80690.8 

1.2times overhead 

237187.9 

215087.3 

201473.8 

129091.8 

112387.0 

105885.1 

93649.4 

81543.8 

lAtimes overhead 

235221.0 

223254.1 

214012.1 

145859.1 

123970.6 

123532.7 

109257.6 

82717.9 

IStimes overhead 

235069.7 

235806.8 

227458.2 

161720.6 

132491.4 

141180.2 

124865.8 

82069.3 

l.Stimes overhead 

236087.7 

235286.4 

229887.0 

164127.1 

143741.9 

158827.7 

140474.1 

81410.6 


* All values are in milliseconds, or ms. 


show the results of query time cost of the algorithms for 
k = 0.2|yi|, 0.4|yi|, 0.6|li| and 0.8|li| respectively. We have 
the following observations. 

First, Random performs worst among these algorithms. By 
considering the number of tuples in sources, MaxT has slightly 
less time cost than Random; By further considering the number 
of intersection tuples between sources, MaxRT performs better 
than MaxT. By selecting the sources with less per-tuple time 
cost, MinT is faster than MaxRT, and by considering the 
intersection in per-tuple time cost, MinRT has even faster 
query rate than MinT. 

Second, SeqPerm has significantly initial time cost for per¬ 
mutation construction and performs worse than MinRT at the 
beginning. As shown in Table [Ij SeqPerm has approximately 
5.5s (or seconds) and 3.9s more time cost than MinRT for 
k = 0.2|yi| and k = 0.4|Yi| respectively. Then, SeqPerm 
quickly catches up MinRT and performs better than MinRT; 
SeqPerm has approximately 0.9s and 6.5s less time cost 
than MinRT for k = 0.6|Yi| and k = 0.8|Ti| respectively. 
The results of SeqPerm show the effectiveness of RePerm 
algorithm that constructs a better permutation than MinRT. 

Third, OnlinePerm has roughly the same performance with 
MinRT at the beginning, and as k increases, OnlinePerm has 
apparently less time cost than MinRT. This is because that the 
effect of intersection between sources is little or non-existent 
when the number of tuples, k, to be retrieved is small, and 
the effect of intersection appears when k is increased. As 
shown in Table |I] OnlinePerm and MinRT have approximately 
time cost of 10.7s and 11.1s for k = 0.2|Yi| respectively, and 
OnlinePerm takes approximately 85.1s to retrieve k = O.SjYil 
tuples while MinRT takes approximately 96.7s. 

Finally, as can be seen from Table OnlinePerm has stable 


less time cost (4-6s) than SeqPerm for k = 0.2|Yi|, 0.4|Fi|, 
0 .6|ki| and 0.8|ki| respectively, which shows that the parallel 
execution of processes in SP and QE component would not 
reduce (or has little effect on) the quality of the permutation 
constructed by RePerm; OnlinePerm has little more time cost 
(0-4s) than FullKnowledge, which shows the efficiency of 
processes that online collect statistics for source permutation 
in SC component. 

B. Varying Query Threads 

We compared the algorithms on varying query threads. The 
row 6-9 of Table show the results of query time costs of the 
algorithms for k = 0.8|Yi| when running 2, 3, 4 and 5 query 
threads at the same time respectively. As can be seen, when 
more query threads are running, (1) the query time cost of the 
algorithms decrease fairly fast, (2) the decrease rate of query 
time cost slows down we usually observed in parallel system, 
and (3) the advantage of OnlinePerm on the performance 
becomes less apparent, e.g., OnlinePerm spends approximately 
18.2s on query execution than 19.7s by MaxRT when 5 query 
threads are running. The performance of OnlinePerm reveals 
that the benefit of Online-detection for statistics collection is 
banlanced by increased running query threads. 

C. Varying Sources 

We compared the algorithms on varying sources, and let the 
number of sources be 2000, 3000, 4000 and 5000 respectively 
by disabling or creating sources in DIS for evaluation. Then, 
we kept the number of distinct tuples and the total number of 
tuples as default settings by moving tuples between sources. 
The evaluation results of varying number of sources for 
k = 0.8|yi| are shown in row 10-13 of Table As can be 
seen, with more sources involved, (1) the query time of the 































algorithms increase, (2) the query time cost of OnlinePerm 
is less than other algorithms, and (3) the query time cost of 
OnlinePerm increases linearly and is approximate to the time 
cost of FullKnowledge, which shows the scalability of our 
online query system. 

D. The Ejfect of Pruning Techniques 

We applied the pruning techniques introduced in Section [V| 
for estimation of | H | based on the statistics collected by 
the SC component. With huge number of variables removed 
by the pruning techniques, SC only need to solve the MaxEnt 
problem of 5137 variables at the beginning of the Online- 
detection stage(, and the number of variable becomes less 
with the execution of Online-detection process). By converting 
the MaxEnt problem to a sparse system of linear equations, 
SC estimated the value of 5137 variables in average 2.7s as 
measured. In contrast, the basic approach without applying 
the pruning techniques should solve the MaxEnt of 2^°^® 
variables. Obviously, this basic approach is uncomputable with 
so huge number of variables. Additionally, we measured the 
case of only 15 sources. In this case, the basic approach spent 
18.7s to solve the MaxEnt while our approach with the pruning 
techniques finished in milliseconds. 

E. The Error of Initial-detection Statistics 

We compared the algorithms on errors of Initial-detection 
statistics. The statistics estimation for Q is more accurate with 
a higher Jj^, and we set |Yi| = 0.2|y|, 0.4|y|, 0.6|F| and 
0.8|y| respectively for evaluation. The evaluation results for 
k = 0.8|Ti| are shown in row 14-17 of Table As can be 
observed, with higher |yi|, (1) the query time cost of the 
algorithms except random decrease, and especially (2) the 
decrease rate of query time cost of OnlinePerm is small, e.g., 
OnlinePerm spends approximately 84.4s for |Yi| = 0.6|1^| 
and 83.1s for lYil = 0.8|y|. The result of OnlinePerm shows 
that OnlinePerm has stable performance on errors of Initial- 
detection statistics. 

F. The Overhead of Online-detection 

We compared the algorithms on overheads of Online- 
detection by adding cycle time and setting the detection time 
cost to be 1.2, 1.4, 1.6 and 1.8 times of original detection time 
cost. The results are shown in row 18-21 of Table [U As can 
be observed from the results, with higher overhead of Online- 
detection, (1) the query time cost of the algorithms except 
random increase, and (2) OnlinePerm has a high increase rate 
of query time cost although it still performs better than other 
algorithms. The results reveal that it is getting harder to gain 
benefit from Online-detection when the detection overhead 
is increasing. In this case, more running threads for Online- 
detection are suggested to balance the effect of increased 
overhead. 

G. Summary 

We evaluated our online scheduling strategy under the 
condition of various factors. 

• Varying k: OnlinePerm is the fastest algorithm among 
all these algorithms; this evaluation result shows the 
efficiency of our strategy that enables SP, SC and QE 
working in parallel. 


• Varying query threads: OnlinePerm still performs best 
among all these algorithms, but the benefit of Online- 
detection of SC is balanced by increased running threads. 

• Varying sources: The query time cost of OnlinePerm 
increases linearly and least among all the algorithms 
when more sources are involved; this evaluation result 
shows that our strategy is scalable. 

• The effect of pruning techniques: By applying the pruning 
techniques, SC can efficiently estimate the statistics of 
sources. 

• The error of initial-detection statistics: OnlinePerm suf¬ 
fers a low performance degradation when the error of 
Initial-detection increases; this evaluation result shows 
that our strategy is efficient and robust. 

• The overhead of online-detection: Although OnlinePerm 
still perform best among all these algorithms, it has a 
high performance degradation when detection overhead 
increases. This result suggests more detection threads of 
SC to balance the effect of increased detection overhead. 

VIII. Conclusion and Euture Work 

We address the Time-Cost Minimization Problem (TMP) 
and propose the online scheduling strategy in this paper. The 
architecture of online query scheduling mainly contains three 
components of Source Permutation (SP), Statistics Collection 
(SC) and Query Execution (QE). We prove that it is NP- 
complete to construct a optimal permutation of sources and 
propose OnlinePerm algorithm that considers the effect of 
query rate and residual tuples for SP. We present a two-stage 
detection mechanism and apply pruning techniques to avoid 
the exponential number of variables estimation for SC. By 
applying the online scheduling strategy, SP, SC and QE work 
in parallel to reduce the total time cost for the query. The 
experiment results show the efficiency and scalability of our 
scheduling strategy. 

In this paper, we simplify the redundant problem by only 
considering the repetitive data tuples between sources. In the 
future, we would concern the time cost of data fusion between 
partially overlapping tuples during online query scheduling. 
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